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Preface 



The 3rd International Semantic Web Conference (ISWC 2004) was held Novem- 
ber 7-11, 2004 in Hiroshima, Japan. If it is true what the proverb says: “Once 
by accident, twice by habit, three times by tradition,” then this third ISWC 
did indeed firmly establish a tradition. After the overwhelming interest in last 
year’s conference at Sanibel Island, Florida, this year’s conference showed that 
the Semantic Web is not just a one-day wonder, but has established itself firmly 
on the research agenda. At a time when special interest meetings with a Seman- 
tic Web theme are springing up at major conferences in numerous areas (ACL, 
VLDB, ECAI, AAAI, ECML, WWW, to name but a few), the ISWC series has 
established itself as the primary venue for Semantic Web research. 

Response to the call for papers for the conference continued to be strong. 
We solicited submissions to three tracks of the conference: the research track, 
the industrial track, and the poster track. The research track, the premier venue 
for basic research on the Semantic Web, received 205 submissions, of which 48 
were accepted for publication. Each submission was evaluated by three pro- 
gram committee members whose reviews were coordinated by members of the 
senior program committee. Final decisions were made by the program co-clrairs 
in consultation with the conference chair and the senior program committee. 
The industrial track, soliciting papers describing industrial research on the Se- 
mantic Web, received 22 submissions, of which 7 were accepted for publication. 
These papers were reviewed by three members of the industrial track program 
committee, and final decisions were made by the industrial track co-chairs. Fi- 
nally, the poster track, designed for late-breaking results and work in progress, 
received 68 submissions, of which 47 were accepted, following two reviews of 
poster summaries by the conference program committee. Final decisions were 
made by the poster track chair. Results of the poster track are reproduced in a 
separate volume. 

One of the Chairs’ prerogatives is to make some sense out of these submis- 
sion statistics. First of all, ISWC 2004 was a truly international forum, with 
accepted contributions from over 20 countries worldwide. It is also instructive to 
look at the distribution of papers across different areas. The chart on the next 
page shows the number of papers submitted and accepted in the different areas. 
(NB: most papers were classified in multiple areas.) We see that two of the more 
“traditional” Semantic Web topics continue to dominate the conference: lan- 
guages/tools/metlrodologies and ontologies. Other core topics are also strongly 
represented: interoperability, Web services, middleware and searching/querying. 
Together, these six core topics already account for 60% of the accepted papers. 
Some topics, although central to the Semantic Web, are surprisingly small in 
number, for example database technologies (5%) and inference/rules (4%). Some 
other topics are already “hot” in smaller workshops, but apparently haven’t 
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made it to the big conference scene yet, such as peer-to-peer and trust (each at 
a modest 2%). 



■ No. submitted nNo. accepted 




Areas: 

1 Languages, tools and methodologies for Semantic Web data 

2 Ontologies (creation, merging, linking and reconciliation) 

3 Semantic integration and interoperability 

4 Semantic Web services (description, discovery, invocation, composition) 

5 Semantic Web middleware 

6 Searching, querying and viewing the Semantic Web 

7 User interfaces 

8 Visualization and modelling 

9 Data semantics 

10 Database technologies for the Semantic Web 

11 Semantic Web inference schemes/rules 

12 Tools and methodologies for Web agents 

13 Large-scale knowledge management 

14 Peer-to-peer systems 

15 Semantic Web mining 

16 Semantic Web trust, privacy, security and intellectual property rights 

17 Semantic brokering 

18 Semantic Web for e-business and e-learning 

19 Knowledge portals 



As with any conference, the quality of the accepted papers and the integrity 
of the review process reflect the hard work of the program committee. We thank 
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our senior program committee, members of both program committees, and our 
auxiliary reviewers for the tremendous effort they put into the task of evaluating 
submissions to the conference. Most importantly, we thank industrial track co- 
chairs Dean Allemang and Jun-Ichi Akahani, and poster track chair, Jeremy 
Carroll for the superb job they did, organizing and coordinating their tracks of 
the conference. 

Invited talks constitute an integral part of the scientific program of an in- 
ternational conference. We were fortunate to have three excellent and diverse 
distinguished lectures as part of the ISWC 2004 technical program. Edward 
Feigenbaum, Kumagai Professor of Computer Science and Director Emeritus, 
Knowledge Systems Laboratory, Stanford University, communicated his views 
on the status and progress of Semantic Web research by speaking on “The Se- 
mantic Web Story - It’s already 2004. Where are we?”. Wolfgang Nejdl, Director 
of the Learning Lab Lower Saxony at the University of Hannover elaborated on 
the research issues involved in distributed search on the Semantic Web with his 
presentation entitled “How to Build Google2Google - An (Incomplete) Recipe 

Marie-Christine Rousset, head of the Artificial Intelligence and Inference 
Systems Group in the Laboratory of Computer Science at the University of 
Paris-Sud, renewed an old knowledge representation theme by addressing the 
expressiveness/tractability trade-off in her talk entitled “Small Can be Beauti- 
ful in the Semantic Web.” This volume includes papers by Nejdl and Rousset 
that are associated with their lectures. 

In addition to the paper and poster tracks, ISWC 2004 included 8 work- 
shops, 6 tutorials, a demonstration session with 45 registered demonstrators, 
the Semantic Web Challenge with 18 participants, and an exhibition featur- 
ing industrial demonstrations. Large participation in these events reflects the 
broad interest in the Semantic Web, and how active the field is. Once again, we 
thank all the chairs for their dedicated efforts towards making the conference a 
success. The local organization of ISWC 2004 went smoothly through the ex- 
traordinary care and attention of those on the organizing committee. We are 
greatly indebted to Riiclriro Mizoguchi, the local arrangements chair, for doing 
a meticulous job. His attention to detail, and the beautiful venue he selected for 
the conference contributed tremendously to the overall experience of the confer- 
ence. John Mylopoulos and Katia Sycara, ISWC 2003 program co-clrairs, also 
deserve our special thanks for their guidance and for sharing their experience 
with last year’s conference. 

Electronic submission of papers and reviews was driven by the Confious 
Conference Management system, developed at ICS FORTH by Manos Papagge- 
lis. This was the first time Confious was used, and the system ran remarkably 
smoothly. We thank Manos for his around-the-clock support during the many 
months preceding the conference. We also owe a debt of gratitude to Akiko In- 
aba for her diligence and aesthetic sense in developing and supporting the ISWC 
2004 Web page. Finally, we extend tremendous thanks to Jorge Baier for the fine 
job he did in preparing these proceedings. 
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We hope that the attendees found the conference both stimulating and en- 
joyable. 

November, 2004 Sheila Mcllraitlr and Dimitris Plexousakis 

Program Co-Chairs 

Frank van Harmelen 
Conference Chair 
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Abstract. This talk explores aspects relevant for peer-to-peer search infrastruc- 
tures, which we think are better suited to semantic web search than centralized 
approaches. It does so in the form of an (incomplete) cookbook recipe, listing 
necessary ingredients for putting together a distributed search infrastructure. The 
reader has to be aware, though, that many of these ingredients are research ques- 
tions rather than solutions, and that it needs quite a few more research papers on 
these aspects before we can really cook and serve the final infrastructure. We’ll 
include appropriate references as examples for the aspects discussed (with some 
bias to our own work at L3S), though a complete literature overview would go 
well beyond cookbook recipe length limits. 



1 Introduction 

Why should we even think about building more powerful peer-to-peer search infrastruc- 
tures? Isn’t Google sufficient for every purpose we can imagine? Well, there are some 
areas where a centralized search engine cannot or might not want to go, for example the 
hidden web or community-driven search. 

The hidden web requires replication of data usually stored in databases in 
a central search engine which seems like a bad idea, even though Froogle 
(http://froogle.google.com/) attempts to do this for a limited domain / purpose (shopping, 
of course). A central data warehouse just does not seem an appropriate infrastructure 
for the world wide web, even though replicating much of the surface web on the Google 
cluster is (still) doable. 

The main characteristic of a community-driven search infrastructure is its strong 
search bias on specific topics / results. While the last two years have seen techniques 
emerging for biasing and personalizing search [1,2,3], catering for a lot of small and 
specific search communities in a centralized search engine neither seems easy to accom- 
plish nor particularily useful to implement for a search engine company which has to 
target the average user, not specialized communities. 

Besides covering these new areas, another advantage of a distributed search in- 
frastructure is the potential for faster updates of indices, because we can exploit local 
knowledge about new and updated content directly at the site which provides it, without 
necessarily crawling all of its content again. Last, but not least, decentralized search 
services might also appeal to those who would rather opt for a more “democratic” and 
decentralized search service infrastructure instead of centralized services provided by a 
(beneficial) monopolist or a few oligopolists. 
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In this talk, we will discuss some necessary ingredients for a Google2Google recipe, 
which have to be mixed together to provide a general decentralized search service in- 
frastructure. Please be aware, that a well-tasting solution is still a few years away, and 
cooking it all together is not as simple as it might seem at the first moment. 



2 Distributed Search Engine Peers 

As the first ingredient, we need a large set of distributed Google2Google search peers. 
These are not just distributed crawlers such as the peers in the distributed search engine 
project Grub (http://grub.org/), but rather provide crawling, indexing, ranking and (peer- 
to-peer) query answering and forwarding functionalities. 

Each Google2Google search peer will be responsible for a small partition of the web 
graph, with some overlaps to achieve redundancy, but without centralized schedulers. As 
the web graph is block structured, with inter-block links much sparser than links within 
blocks, search peers have to orient themselves on this structure [4,5]. 

Obviously, it will be important how we connect these peers. A super peer architecture 
[6,7] might be a good choice because it allows us to include task specific indexing and 
route optimization capabilities in these super peers. As for the exact topology, we will 
have a range of options, most probably building upon one of the P2P topologies derived 
from Cayley graphs [8] used in DHT and other P2P networks [9,10,1 1,12]. 

Typical search engine users will rely on simple text string searches, though for 
more sophisticated applications we certainly want to provide more sophisticated query 
capabilities as they are available for example in the Edutella network [13]. For certain 
applications we might opt for publish/subscribe infrastructures instead [ 14], which can 
re-use many of the ingredients mentioned in this talk. 



3 Distributed Ranking Algorithms 

One of the main reasons Google quickly replaced older search engines was its ability to 
rank results based on the implicit recommendations of Web page writers expressed in 
the link structure of the web. So we definitively have to throw in a ranking algorithm to 
provide Google2Google with comparable functionalities. 

Unfortunately, two main ingredients of Google’s ranking algorithm (TF/IDF and 
PageRank) rely on collection wide document properties (IDF) and central computation 
(PageRank). There is hope, however, that we are able to solve these problems in the 
future: IDF values in many cases do not change much when new documents are added 
[15], and distributed algorithms for computing pagerank and personalized pagerank have 
been proposed [16,3]. 

Furthermore, pagerank variants more suited to decentralized computation have re- 
cently been investigated [17,18], and promise to decrease communication costs in a 
Google2Google setting. These algorithms compute PageRank-like values in two sepa- 
rate steps, first doing local computation within a site, then computation between sites 
(without taking specific pages into account). Additional analysis is needed on how they 
compare to PageRank and how sites with a lot of non-related pages (e.g. mass hosters 
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like Geocities) can be handled. Google2Google search peers will probably have to rely 
on blocks within these sites or group the pages of such sites in community-oriented 
clusters. 

4 Top-k Retrieval and Optimization 

Obviously, ranking algorithms are not enough, we also have to use them to prune 
Google2Google answers to the best k ones. Imagine shipping hundred thousand an- 
swers and more to the user, who only wants to look at the top ten or top twenty answers 
in most cases anyway. 

Recent work on top-k query processing and optimization in peer-to-peer networks 
has addressed these issues, and has shown how (meta) data on query statistics and 
ranking methods can be used to retrieve only the k best resources [19,20]. Ranking 
in this context is used to score the results that come from distributed sources, reduce 
the number of answers, enable partial matches to avoid empty result sets and optimize 
query forwarding and answering. Briefly, these algorithms assume a super-peer network 
connected using for example a hypercube-derived topology [12], and implement three 
intertwined functionalities (more details are presented in another presentation [19]): 

- Ranking: Each peer locally ranks its resources with respect to the query and returns 
the local top-k results to its super-peer. 

- Merging: At the super-peers results from the assigned peers are ranked again and 
merged into one top-k list. These answers are returned through the super-peer back- 
bone to the querying peers, with merges at all super-peers involved. 

- Routing: Super-peer indices store information from which directions the top-k an- 
swers were sent for each query. When a known query arrives at a super-peer these 
indices are used to forward the query to the most promising (super-) peers only. A 
small percentage of queries - depending on the volatility of the network - has to be 
forwarded to other peers as well, to update the indices in a query-driven way. 

5 Trust and Security 

Finally, trust and security play a role even more important in our decentralized 
Google2Google infrastructure than in more centralized settings. Topics here range from 
the issue of decentralized trust establishment and policy-based access control for dis- 
tributed resources [21,22] to distributed computations of trust values meant to achieve 
the same benefits as trust networks in social settings [16,23]. The presentations at the 
1st Workshop on Trust, Security and Reputation on the Semantic Web [24] covered a 
large set of topics in this area and initiated a lot of fruitful discussions. 

In our Google2Google setting, malicious peers could for example try to subvert 
information they provide, an issue especially critical for ranking information. Recent 
analysis of attack scenarios in a distributed PageRank setting [16,3], have been encour- 
aging, though, and have shown that with suitable additional modifications of the dis- 
tributed algorithm, PageRank computation becomes quite un-susceptible to malicious 
peers even in a decentralized setting. Personalized PageRank variants can be made even 
more resiliant if we bias them towards trusted sites / peers [25,26]. 
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6 Conclusion 

Even though a decentralized infrastructure makes seemingly simple things surprisingly 
difficult (for example how to implement the “Did you mean xx’’ functionality in Google, 
which relies on (global) query and answer statistics), our short search for available 
ingredients for a Google2Google infrastructure already turned out rather successful. It 
will still take quite a few more research papers (plus an appropriate business / incentive 
model for a distributed search engine infrastructure) until we will be able to finalize our 
Google2Google recipe. But once we are finished we will have most certainly realized 
quite a few new opportunities for searching and accessing data on the (semantic) web. 
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me thank my friend and colleague Karl Aberer for inventing and using the term 
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Abstract. In 1 984, Peter Patel-Schneider published a paper [ 1 ] entitled Small can 
be Beautiful in Knowledge Representation in which he advocated for limiting the 
expressive power of knowledge representation formalisms in order to guarantee 
good computational properties and thus make knowledge-based systems usable 
as part of larger systems. In this paper, I aim at showing that the same argument 
holds for the Semantic Web: if we want to give a chance for the Semantic Web 
to scale up and to be broadly used, we have to limit the expressive power of the 
ontologies serving as semantic marking up of resources. In addition, due to the 
scale of the Web and the disparity of its users, it is unavoidable to have to deal 
with distributed heterogeneous ontologies. In this paper, I will argue that a peer-to- 
peer infrastructure enriched with simple distributed ontologies (e.g., taxonomies) 
reconciled through mappings is appropriate and scalable for supporting the future 
Semantic Web. 



1 Introduction 

The Semantic Web [2] envisions a world-wide distributed architecture where data and 
computational resources will easily inter-operate based on semantic marking up of web 
resources using ontologies. Ontologies are a formalization of the semantics of appli- 
cation domains (e.g., tourism, biology, medecine) through the definition of classes and 
relations modeling the domain objects and properties that are considered as meaningful 
for the application. Building ontologies is a hard and time-consuming task for which 
there exists only some very general principles to guide the ontology designers [3,4,5] 
who still have to face with many modeling choices. Even for a same domain, different 
modeling choices can lead to very different ontologies. In particular, the choice of the 
basic classes and relations for modeling the basic domain vocabulary is subject to many 
variations depending on the ontology designers. The appropriate level of detail of the 
ontology descriptions is not easy to determine either and mainly depends on the purpose 
of the ontology construction. 

Several important issues remain open concerning the building and usage of ontolo- 
gies in the setting of the Semantic Web. In this talk, I will discuss some of them and I 
will explain my vision of the kind of infrastructure that I consider as scalable for sup- 
porting the future Semantic Web. An instance of that infrastructure is implemented in 
the Somewhere [6] system that I will present briefly. 
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2 Simple Versus Complex Ontologies 



A first question concerns the choice of the right expressive power for an adequate mod- 
eling of the ontologies needed in the Semantic Web. The current tendancy promoted by 
the knowledge representation community is to consider that formal languages with high 
expressive power are required because they enable a fine-grained description of domain 
ontologies. Such a choice is questionable for several reasons. The main argument that 
I would oppose against this choice is algorithmic: well-known complexity results show 
that exploiting ontologies modeled using expressive formal languages (e.g., OWL [7]) 
cannot scale up to complex or large ontologies. Another argument is that formal lan- 
guages with high expressive power are difficult to handle for users having to model an 
application domain. In the Semantic Web setting, the purpose of ontologies is to serve 
as a semantic markup for web resources and as a support for querying in order to obtain 
more efficient and precise search engines. Therefore, they must be simple to be cor- 
rectly understood and rightly exploited by humans (users or experts), and they must be 
expressible in a formal language for which the algorithmic complexity of reasoning is 
reasonable to make them really machine processable. Taxonomies of atomic classes are 
examples of simple ontologies that I envision as good candidates for serving as marking 
up resources at the Web scale. They are easy to understand for users and practitioners. 
They may even be automatically constructed by data or text mining. They are conform 
to the W3C recommandation being a subset of OWL-DL that we could call OWL-PL 
since they can be encoded in propositional logic. As a result, query answering becomes 
feasible at a large scale, which is the goal to reach eventually if we want the Semantic 
Web to become a reality. 



3 Personalized and Distributed Versus Standardized Ontologies 



Another question deals with the possibility of building consensual domain ontologies 
that would be broadly shared by users over the Web for marking up their data or resources. 
Building a universal ontology (or using an existing one, e.g., CYC [8]) that could serve 
as a reference set of semantic tags for labelling the data and documents of the Web is, 
at worst an utopia, at best an enormous enterprise which may eventually turn out to be 
useless in practice for the Semantic Web. The current tendency (e.g., [9,10,1 1]) consists 
in building standardized ontologies per domain. So far, it is an open question whether 
such an approach is likely to scale up to the Web because it cannot be taken as granted 
that users will appropriate those ontologies. The risk is that ontology designers spend 
a lot of time building an expected consensual ontology which will not be eventually 
broadly used because end users will prefer to use their own ontologies. In the same 
way as we have our own view of the nesting and the names of the different folders for 
structuring our personal file systems, mail files or bookmarks, it is likely that people 
will prefer using their own ontology to mark up the data or resources they agree to make 
available through the Semantic Web. There is a little chance that they will accept to use 
an external ontology which they are not used to for re-labelling the resources that they 
have already marked- up. 
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However, if the ontologies are simple (e.g., taxonomies of terms) users may be ready 
to establish mappings between their own ontology and ontologies of some users with 
whom they share some topics of interest. 

Consider for instance a user Bob found of music who has already stored plenty of 
music files in three folders that he has named Classic, Jazz and Rock to distinguish 
pieces of classical music, jazz and rock respectively. Suppose now that by searching new 
music files on the web, he discovers a user T om making available on his web page music 
files but marked up by the name of the composers. Establishing a mapping saying that 
what Tom marks as Mozart (pieces of music) is a specialization of his own view of 
Classic (pieces of music) is straighforward for Bob. This mapping will make possible 
for any user querying the web using Bob’s ontology to get Mozart files stored at the 
Tom’s web page as an answer to his search for classical pieces of music. 

In this vision of the Semantic Web introduced in [ 12], no user imposes to others his 
own ontology but logical mappings between ontologies make possible the creation of a 
web of people in which personalized semantic marking up of data cohabits nicely with 
a collaborative exchange of data. In this view, the Web is a huge Peer Data Management 
System based on simple distributed ontologies and mappings. 



4 Peer Data Management Systems 



Peer data management systems (PDMS) have been proposed recently [13,14,15,16] to 
generalize the centralized approach of information integration systems based on single 
mediators, in which heterogeneous data sources are reconciled through logical mappings 
between each source schema or ontology and a single mediated schema or ontology. A 
centralized vision of mediation is appropriate for building semantic portals dedicated to 
a given domain but is too rigid to scale up to the Semantic Web for which PDMSs based 
on distributed mediation are more adapted. In a PDMS, there is no central mediator: 
each peer has its own ontology and data or services, and can mediate with some other 
peers to ask and answer queries. The existing PDMSs vary according to (a) the expressive 
power of their underlying data model and (b) the way the different peers are semantically 
connected. Both characteristics have impact on the allowed queries and their distributed 
processing. 

In Edutella [17], each peer stores locally data (educational resources) that are de- 
scribed in RDF relatively to some reference ontologies (e.g., dmoz [9]). For instance, 
a peer can declare that it has data corresponding to the concept of the dmoz taxonomy 
corresponding to the path Computers/Programming/Languages/Java, and that for such 
data it can export the author and the date properties. The overlay network underlying 
Edutella is a hypercub of super-peers to which peers are directly connected. Each super- 
peer is a mediator over the data of the peers connected to it. When it is asked, its first 
task is to check if the query matches with its schema: if that is the case, it transmits 
the query to the peers connected to it, which are likely to store the data answering the 
query ; otherwise, it routes the query to some of its neighbour super-peers according to 
a strategy exploiting the hypercub topology for guaranteeing a worst-case logarithmic 
time for reaching the relevant super-peer. 
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In contrast with Edutella, Piazza [13,18] does not consider that the data distributed 
over the different peers must be described relatively to some existing reference schemas. 
In Piazza, each peer has its own data and schema and can mediate with some other peers 
by declaring mappings between its schema and the schemas of those peers. The topology 
of the network is not fixed (as the super-peers hypercub in Edutella) but accounts for the 
existence of mappings between peers: two peers are logically connected if there exists a 
mapping between their two schemas. The underlying data model considered in the first 
version of Piazza [13] is relational and the mappings between relational peer schemas 
are inclusion or equivalence statements between conjunctive queries. Such a mapping 
formalism encompasses the local-as-views and the global-as-views formalisms used in 
information integration based on centralized single mediators for stating the mappings 
between the schemas of the data sources to be integrated and the mediated schema. 
The price to pay is that query answering is undecidable except if some restrictions 
are imposed on the mappings or on the topology of the network ([13]). The currently 
implemented version of Piazza [18] relies on a tree-based data model: the data is in XML 
and the mappings are equivalence and inclusion statements between XML queries. Query 
answering is implemented based on practical (not complete) algorithms for XML query 
containment and rewritings. The scalability of Piazza so far does not go up to more 
than about eighty peers in the published experiments and relies on a wide range of 
optimizations (mappings composition [19], paths pruning [20]), made possible by the 
centralized storage of all the schemas and mappings in a global server. 

In Somewhere [6], we have made the choice of being fully distributed: there are 
neither super-peers (as in Edutella) nor a central server having the global view of the 
overlay network as in Piazza. In addition, we aim at scaling up to thousands of peers. For 
making it possible, we have chosen a simple class-based data model in which the data is a 
set of resource identifiers (URIs or URLs), the schemas are (simple) definitions of classes 
possibly constrained by inclusion, disjunction or equivalence statements, and mappings 
are inclusion, disjunction or equivalence statements between classes of different peer 
schemas. That data model is in accordance with the W3C recommandations since it is 
captured by the propositional fragment of the OWL ontology language. 



5 Overview of the Somewhere PDMS 

This section reports a joint work [6] with Philippe Adjiman, Philippe Chatalic, Francois 
Goasdoue and Laurent Simon. 

5.1 Data Model 

In Somewhere a new peer joins the network through some peers that it knows (its 
acquaintances) by declaring mappings between its own ontology and the ontologies of 
its acquaintances. Queries are posed to a given peer using its local ontology. The answers 
that are expected are not only instances of local classes but possibly instances of classes 
of peers distant from the queried peer if it can be infered from the peer ontologies and 
the mappings that they satisfy the query. Local ontologies, storage descriptions and 
mappings are defined using a fragment of OWL DL which is the description logics 
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fragment of the Ontology Web Language recommended by W3C. We call OWL-PL the 
fragment of OWL-DL that we consider in Somewhere, where PL stands for propositional 
logic. OWL PL is the fragment of OWL DL reduced to the union (U), intersection (11) 
and complement (-<) constructors for building class descriptions. 

Peer ontologies: Each peer ontology is made of a set of axioms of class (partial or com- 
plete) definitions that associate class identifiers with class descriptions, and disjointness, 
equivalence or inclusion statements between class descriptions. Class identifiers are 
unique to each peer: we use the notation P:C I for a class identifier Cl of the ontology 
of a peer P. 

Peer storage descriptions: The specification of the data that is stored locally in a peer 
P is done through assertional statements relating data identifiers (e.g., URLs or URIs) 
to class identifiers of the ontology of the peer P. The class identifiers of P involved in 
such statements are called the extensional classes of P and their extensions are the sets 
of data identifiers associated with them. 

Mappings: Mappings are disjointness, equivalence or inclusion statements involving 
classes of different peers. They express the semantic correspondence that may exist 
between the ontologies of different peers. 

Schema of a Somewhere peer-to-peer network: In a Somewhere network, the schema is 
not centralized but distributed through the union of the different peer ontologies and the 
mappings. The important point is that each peer has a partial knowledge of the schema: 
it just knows its own data and local ontology, and mappings with its acquaintances. 

Let V be a Somewhere peer-to-peer network made of a collection of peers 
{Pj}i 6 [ i..„]. For each peer P ( , let O,; and Mj be respectively the local ontology of 
Pj and the set of mappings stated at P, between classes of (), and classes of the ontolo- 
gies of the acquaintances of P,;. The schema S of V is the union Ui<=[i n ] U Mj °f 
the ontologies and of the sets of mappings of all the peers composing V . 

Semantics: The semantics is a standard logical formal semantics defined in terms of 
interpretations. An interpretation I is a pair (A 1 , /) where A is a non-empty set, called 
the domain of interpretation, and / is an interpretation function which assigns a subset 
of A 1 to every class identifier and an element of A 1 to every data identifier. 

An interpretation / is a model of the distributed schema of a Somewhere peer-to-peer 
network V = {Pj}ie[i.. rt ] iff eac h axiom in U j£[i n \ U Mj is satisfied by I. 

Interpretations of axioms rely on interpretations of class descriptions which are 
inductively defined as follows: 

- (Cl U C 2 y = C[ U C{ 

- (Cl n C2) 1 = cl n Cl 

- (-. cy = a 7 \ c 7 

Axioms are satisfied if the following holds: 
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- CCBis satisfied in I iff C 1 C D 1 

- C = D is satisfied in I iff C 1 = D 1 

- C n D = _L is satisfied in / iff C 1 Cl D 1 = 0 

- C(a) is satisfied in I iff a 1 £ C 1 

A Somewhere peer-to-peer network is satisfiable iff its (distributed) schema has a 
model. 

Given a Somewhere peer-to-peer network V = {-P*}»e[i..n]> a class description C 
subsumes a class description D iff in each model I of the schema of V, D 1 C C 1 . 

5.2 Illustrative Example of the Somewhere Data Model 

Let us consider four persons Ann, Bob, Chris and Dora, each of them bookmarking URLs 
about restaurants they know or like. Each has his/her own taxonomy for categorizing 
restaurants. In addition, they have to describe their stored available data, i.e., the sets of 
URLs that they accept to make available to the PDMS. They do it by declaring some 
extensional classes (denoted P:ViewX) as subclasses of some classes (P:X) of their 
ontology and by assigning to them the set of corresponding URLs as their instances. 

Ann, who is working as a restaurant critics, organizes its restaurant URLs according 
to the following classes: 

- the class Ann:G of restaurants considered as offering a "good" cooking, among 
which she distinguishes the subclass Ann:R of those which are rated: 

Ann\R C Anrv.G. 

- the class Ann:R is the union of three disjoint classes Ann:S 1, Ann:S2, Ann: S3 
corresponding respectively to the restaurants rated with 1, 2 or 3 stars: 

Ann:R = Ann:Sl U Ann:S2 U Ann:S3 

Ann:S 1 n Ann:S2 = _L Ann: SI n Ann:S3 = _L Ann:S2 n Ann:S3 = _L 

- the classes Ann:I and Ann'.O, respectively corresponding to Indian and Oriental 
restaurants, 

- the classes Ann:C, Ann:T and Ann:V which are subclasses of Ann'.O denoting 
Chinese, Tai and Vietnamese restaurants respectively: 

Ann:C C Ann:0 , Ann:T C Ann:0 , Ann:V C Ann:0 

Suppose that the data stored by Ann she accepts to make available are data on 
restaurants of various specialties, but that among the rated restaurants she stores and 
makes available those rated with 2 stars. The extensional classes declared by Ann are 
then: 

Ann:ViewS2 C Ann:S2 , Ann:ViewC C Ann:C 
Ann:ViewV C Ann:V , Ann:ViewT C Ann:T 
Ann:ViewI C Ann:I 

Bob, who is found of Asian cooking and likes high quality, organizes his restaurant 
URLs according to the following classes: 

- the class Bob: A of Asian restaurants, 

- the class Bob:Q of restaurants of high quality that he knows. 
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Suppose that he wants to make available every data that he has stored. The extensional 
classes that he declares are Bob'.ViewA and Bob:ViewQ (as subclasses of Bob:A and 
Bob'.Q ): 

Bob'.ViewA C Bob:A , Bob'.ViewQ C Bob'.Q 

Chris is more found of fish restaurants but recently discovered some places serving a 
very nice cantonese cuisine. He organizes its data with respect to the following classes: 

- the class Chris'.F of fish restaurants, 

- the class Chris'.CA of Cantonese restaurants 

Suppose that he declares the extensional classes Chris'.ViewF and 
Chris:ViewC A as subclasses of Chris'.F and Chris'.CA respectively: 
Chris'.ViewF C Chris'.F , Chris'.ViewC A C Chris'.CA 

Finally, Dora organizes her restaurants URLs around the class Dora'.DP of her 
preferred restaurants, among which she distinguishes the subclass Dora'.P of pizzerias 
and the subclass Dora'.SF of seafood restaurants. 

Suppose that the only URLs that she stores concerns pizzerias: the only extensional 
class that she has to declare is Dora'.ViewP, as a subclass of Dora'.P : 

Dora'.ViewP C Dora'.P 

Ann, Bob, Chris and Dora are modelled as four peers. Each of them is able to express 
what he/she knows about others using mappings stating properties of class inclusion or 
equivalence. 

Ann is very confident in Bob’s taste and agrees to include Bob’selection as good restau- 
rants by stating Bob'.Q C Ann'.G. Finally, she thinks that Bob’s Asian restaurants 
encompass her concept of Oriental restaurant : Amv.O C Bob: A. 

Bob: Bob knows that what he calls Asian cooking corresponds exactly to what Ann clas- 
sifies as Oriental cooking. This may be expressed using the equivalence statement : 
Bob:A = Ann'.O (note the difference of perception of Bob and Ann regarding the 
mappings between Bob:A and Ann'.O). 

Chris: Chris considers that what he calls fish specialties is a particular case of what 
Dora calls seafood specialties : Chris'.F C Dora'.SF 
Dora: Dora counts on both Ann and Bob to obtain good Asian restaurants : 

Bob: A n Amv.G C Dora'.DP 

Figure 1 describes the peer network induced by the mappings. In order to alleviate 
the notations, we omit the local peer name prefix except for the mappings. Edges are 
labeled with the class identifiers that are shared through the mappings between peers. 

5.3 Query Answering 

Queries and answers: Queries are combinations of classes of a given peer ontology. The 
corresponding answer sets are expressed in intention in terms of the combinations of 
extensional classes that are rewritings of the query. The point is that extensional classes 
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Fig. 1 . The restaurants PDMS 



of distant peers can participate to the rewritings and thus their instances to the answer 
set of a query posed to a given peer. 

Given a Somewhere peer-to-peer network V = i,. n ]> a logical combination 

Q e of extensional classes is a rewriting of a query Q iff Q subsumes Q e . Q e is a maximal 
rewriting if there does not exist another rewriting Q' e of Q subsuming it. 

In the Somewhere setting, query rewriting can be equivalently reduced to distributed 
reasoning over logical propositional theories by a straighforward propositional encod- 
ing of the distributed ontologies and mappings composing the distributed schema of a 
Somewhere peer-to-peer network. 

Propositional encoding: The propositional encoding concerns the schema of a Some- 
where peer-to-peer network and the queries. It consists in transforming each query and 
schema statement into a propositional formula using class identifiers as propositional 
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variables. The propositional encoding of a class description D. and thus of a query, is 
the propositional formula Prop(D) obtained inductively as follows: 

- Prop(A) = A, if A is a class identifier 

- Prop(Di n D 2 ) = Prop(Di) A Prop(D 2 ) 

- Prop(D 1 U D 2 ) = Prop(Di) V Prop(D 2 ) 

- Prop(^D) = -i (Prop(D)) 

The propositional encoding of the schema S of a Somewhere peer-to-peer network 
V is the distributed propositional theory Prop(S) made of the formulas obtained induc- 
tively from the axioms in S as follows: 

- Prop(C C D) = Prop(C) => Prop(D) 

- Prop(C = D) = Prop(C) O Prop(D) 

- Prop(C n D = _L)= -i Prop(C) V -^Prop(D) 

That propositional encoding transfers satisfiability and maps (maximal) conjunctive 
rewritings of a query Q to clausal proper (prime) implicates of the propositional formula 
-^Prop(Q). 

Therefore, we can use the message passing algorithm presented in [21] for query 
rewriting in Somewhere. That algorithm is the first consequence finding algorithm in 
a peer-to-peer setting: it is anytime and computes consequences gradually from the 
solicited peer to peers that are more and more distant. 

We illustrate the distributed resulting query processing on the example in Section 5.2. 
Consider that a user queries the restaurants PDMS through the Dora peer by asking the 
query Dora'.DP, meaning that he is interested in getting as answers the set of favourite 
restaurants of Dora: 

- He will get as a first rewriting Dora:ViewP corresponding to the extensional class 
of the URLs of pizzerias stored locally by Dora. 

- Then, the mapping Chris'.F C Dora:SF leads to a new rewriting, Chris'.ViewF, 
meaning that a way to get restaurants liked by Dora is to obtain the Fish restaurants 
stored by Chris. 

- Finally, the mapping Bob: A n Ann:G C Dora'.DP leads to the splitting of Bob: A n 
Ann:G into the two subqueries Bob: A and Ann:G\ they are transmitted respectively 
to the peers Bob and Ann , which process them independently: 

• Bob'.ViewA is a local rewriting of Boh: A. which is transmitted back to the Dora 
peer, where it is queued for a future combination with rewritings of the other 
subquery Ann:G. In addition, guided by the mapping Ann:0 = Bob: A, the Bob 
peer transmits to the Ann peer the query Ann:0\ the Ann peer processes that 
query locally and transmits back to the Bob peer the rewriting: Ann:ViewC U 
Ann:V iewT U Ann:V iewV, which in turn is transmitted back to the Dora peer 
as an additional rewriting for the subquery Bob:A and queued there, 

• Ann:ViewS 2 is a local rewriting of Ann:G, which is transmitted back to 
the Dora peer, and combined there with the two queued rewritings of Bob:A 
(Bob'.ViewA and Ann:V iewC U Ann:V iewTU Ann:ViewV). 
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As a result, two rewrirings are sent back to the user: 

- Ann'.ViewS 2n Bob:ViewA, meaning that a way to obtain restaurants liked 
by Dora is to find restautants that are both stored by Ann as rated with 2 stars 
and by Bob as Asian restaurants, 

- Ann'.ViewS 2n (ViewC U Ann'.ViewTU Ann'.ViewV), meaning that an- 
other way to obtain restaurants liked by Dora is to find restautants stored by 
Ann as restaurants rated with 2 stars and also as Chinese, Thai or Vietnamese 
restaurants. Note that this rewriting, which is obtained after several splitting and 
re-combining turns out to be composed of extensional classes of the same peer 
(Ann). 

• Because of the mapping Bob:Q C Ann:G, Ann transmits the query Bob'.Q to 
Bob, which transmits back to Ann Bob:ViewQ as a rewriting of Bob'.Q and 
then of Ann'.G. Ann then transmits Bob'.ViewQ back to Dora as a rewriting of 
Anw.G. At Dora’s side, Bob'.ViewQ is now combined with the queued rewrit- 
ings of Bob: A ( Bob: View A and AnmViewCU Ann'.ViewTU Ann:ViewV). 
As a result, two new rewrirings are sent back to the user: 

- Bob'.ViewQU Bob'.ViewA, meaning that to obtain restaurants liked by Dora 
you can take the restaurants that Bob stores as high quality restaurants and also 
as Asian restaurants, 

- Bob'.ViewQU {Anw.ViewCU Ann'.ViewTU Ann'.ViewV ), providing a 
new way of getting restaurants liked by Dora (restaurants that are both con- 
sidered as high quality restaurants by Bob and stored as Chinese, Thai or Viet- 
namese restaurants). 

A peer-to-peer architecture implementing this distributed query rewriting algorithm 
has been developed and the first experimental results of its scalability are promising [6]. 
This architecture is used in a joint project with France Telecom, which aims at enriching 
peer-to-peer web applications with semantics in the form of taxonomies (e.g., Someone 
[ 12 ]). 

6 Conclusion 

Most of the concepts, tools and techniques deployed so far by the Semantic Web com- 
munity correspond to the "big is beautiful" idea that high expressivity is needed for 
describing domain ontologies. As a result, when they are applied, the so-called Seman- 
tic Web technologies are mostly used for building thematic portals but do not scale up 
to the Web. 

In this paper, I have argued in favour of a "simple-is-beautiful" vision of the Semantic 
Web consisting in progressing step by step from the current web towards a more semantic 
web. The first challenging step (which is far being reached) should be to do best than 
Google for searching through the whole Web. My vision of a "Semantic Google" would 
be to replace the use of words for annotating web documents by terms of a taxonomy. 
Though terms of a taxonomy are words, the (big) difference is that the taxonomy provides 
a kind of context of interpretation for those terms which is most of the time sufficient 
in practice to desambiguate their meaning. Therefore, it is important that taxonomies 
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whose terms are used for annotating web resources are attached to those web resources. 
In this vision, any user is allowed to annotate freely web resources with terms of the 
taxonomies of his choice but he must attach those taxonomies to the web resources he 
has annotated. The glue of such a semantic web would be provided by mappings between 
taxonomies, and the infrastructure implementing it would be a peer-to-peer one. 
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Abstract. This paper describes a method for converting existing the- 
sauri and related resources from their native format to RDF(S) and 
OWL. The method identifies four steps in the conversion process. In 
each step, decisions have to be taken with respect to the syntax or se- 
mantics of the resulting representation. Each step is supported through 
a number of guidelines. The method is illustrated through conversions of 
two large thesauri: MeSH and WordNet. 



1 Introduction 

Thesauri are controlled vocabularies of terms in a particular domain with hi- 
erarchical, associative and equivalence relations between terms. Thesauri such 
as NLM’s Medical Subject Headings (MeSH) are mainly used for indexing 
and retrieval of articles in large databases (in the case of MeSH the MED- 
LINE/PubMed database containing over 14 million citations 1 ). Other resources, 
such as the lexical database WordNet, have been used as background knowl- 
edge in several analysis and semantic integration tasks [2]. However, their native 
format, often a proprietary XML, ASCII or relational schema, is not compati- 
ble with the Semantic Web’s standard format, RDF(S). This paper describes a 
method for converting thesauri to RDF / OWL and illustrates it with conversions 
of MeSH and WordNet. 

The main objective of converting existing resources to the RDF data model 
is that these can then be used in Semantic Web applications for annotations. 
Thesauri provide a hierarchically structured set of terms about which a commu- 
nity has reached consensus. This is precisely the type of background knowledge 
required in Semantic Web applications. One insight from the submissions to the 
Semantic Web challenge at ISWC’03 2 was that these applications typically used 
simple thesauri instead of complex ontologies. 

Although conversions of thesauri have been performed, currently no accepted 
methodology exists to support these efforts. This paper presents a method that 

1 http:/ /www. ncbi.nlm.iiili.gov/entrez/ 

2 http:/ /www-agki. tzi.de/swc/swc2003submissions.html 

S.A. Mcllraith et al. (Eds.): ISWC 2004, LNCS 3298, pp. 17-31, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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can serve as the starting point for such a methodology. The method and guide- 
lines are based on the authors’ experience in converting various thesauri. This 
paper is organized as follows. Section 2 provides introductory information on 
thesauri and their structure. In Sect. 3 we describe our method and the ratio- 
nale behind its steps and guidelines. Sections 4 and 5 each discuss a case study 
in which the conversion method is applied to MeSH and WordNet, respectively. 
Additional guidelines that were developed during the case studies, or are more 
conveniently explained with a specific example, are introduced in these sections. 
Related research can be found in Sect. 6. Finally, Sect. 7 offers a discussion. 



2 Structure of Thesauri 

Many thesauri are historically based on the ISO 2788 and ANSI/NISO Z39.19 
standards [3,1]. The main structuring concepts are terms and three relations 
between terms: Broader Term (BT), Narrower Term (NT) and Related Term 
(RT). Preferred terms should be used for indexing, while non-preferred terms 
are included for use in searching. Preferred terms (also known as descriptors) 
are related to non-preferred terms with Use For (UF); USE is the inverse of this 
relation. Only preferred terms are allowed to have BT, NT and RT relations. The 
Scope Note (SN) relation is used to provide a definition of a term (see Fig. 1). 



UF 




Fig. 1 . The basic thesaurus relations. Scope note is not shown. 



Two other constructs are qualifiers and node labels. Homonymous terms should 
be supplemented with a qualifier to distinguish them, for example “BEAMS 
(radiation)” and “BEAMS (structures)”. A node label is a term that is not 
meant for indexing, but for structuring the hierarchy, for example “KNIVES By 
Form”. Node labels are also used for organizing the hierarchy in either fields 
or facets. The former divides terms into areas of interest such as “injuries” and 
“diseases”, the latter into more abstract categories such as “living” and “non- 
living” [3]. 

The standards advocate a term-based approach, in which terms are related 
directly to one another. In the concept-based approach [7], concepts are inter- 
related, while a term is only related to the concept for which it stands; i.e. a 
lexicalization of a concept [12]. The concept-based approach may have advan- 
tages such as improved clarity and easier maintenance [6]. 
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3 Method Description 

The method is divided into four steps: (0) a preparatory step, (1) a syntactic 
conversion step, (2) a semantic conversion step, and (3) a standardization step. 
The division of the method into four steps is an extension of previous work [15]. 



3.1 Step 0: Preparation 

To perform this step (and therefore also the subsequent steps) correctly, it is 
essential to contact the original thesaurus authors when the documentation is 
unclear or ambiguous. An analysis of the thesaurus contains the following: 

— Conceptual model (the model behind the thesaurus is used as background 
knowledge in creating a sanctioned conversion); 

— Relation between conceptual and digital model; 

— Relation to standards (aids in understanding the conceptual and digital 
model); 

— Identification of multilinguality issues. 

Although we recognize that multilinguality is an important and complicating 
factor in thesaurus conversion (see also [8]), it is not treated in this paper. 



3.2 Step 1: Syntactic Conversion 

In this step the emphasis lies on the syntactic aspects of the conversion pro- 
cess from the source representation to RDF(S). Typical source representations 
are (1) a proprietary text format, (2) a relational database and (3) an XML 
representation. This step can be further divided into two substeps. 



Step la: structure-preserving translation. In Step la, a structure- 
preserving translation between the source format and RDF format is performed, 
meaning that the translation should reflect the source structure as closely as pos- 
sible. The translation should be complete, meaning that all semantically relevant 
elements in the source are translated into RDF. 

Guideline 1: Use a basic set of RDF(S) constructs for the structure- 
preserving translation. Only use constructs for defining classes, sub- 
classes, properties (with domains and ranges), human-readable rdf s : labels 
for class and property names, and XML datatypes. These are the basic build- 
ing blocks for defining an RDF representation of the conceptual model. The 
remaining RDF(S) and OWL constructs are used in Step 2 for a semantically 
oriented conversion. However, one might argue that the application of some 
constructs (e.g. domains and ranges) also belongs to semantic conversion. 
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Guideline 2: Use XML support for datatyping. Simple built-in XML Schema 
datatypes such as xsd:date and xsd: integer are useful to supply schemas 
with information on property ranges. Using user-defined XML Schema 
datatypes is still problematic 3 ; hopefully this problem will be solved in the 
near future. 

Guideline 3: Preserve original naming as much as possible. Preserving the 
original naming of entities results in more clear and traceable conversions. 
Prefix duplicate property names with the name of the source entity to make 
them unique. The meaning of a class or property can be explicated by 
adding an rdf s : comment, preferably containing a definition from the orig- 
inal documentation. If documentation is available online, rdfs:seeAlso or 
rdf s : isDef inedBy statements can be used to link to the original documen- 
tation and/or definition. 

Guideline 4: Translate relations of arity three or more into structures 
with blank nodes. Relations of arity three or more cannot be translated 
directly into RDF properties. If the relation’s arguments are independent 
of each other, a structure can be used consisting of a property (with the 
same name as the original relation) linking the source entity to a blank node 
(representing the relation), and the relation’s arguments linked to the blank 
node with an additional property per argument (see examples in Sect. 4). 

Guideline 5: Do not translate semantically irrelevant ordering infor- 
mation. Source representations often contain sequential information, e.g. 
ordering of a list of terms. These may be irrelevant from a semantic point of 
view, in which case they can be left out of the conversion. 

Guideline 6: Avoid redundant information. Redundant information creates 
representations which are less clear and harder to maintain. An example on 
how to avoid this: if the Unique Identifier (UI) of a resource is recorded in 
the rdf : ID, then do not include a property that also records the UI. 

Guideline 7: Avoid interpretation. Interpretations of the meaning of infor- 
mation in the original source (i.e., meaning that cannot be traced back to 
the original source or documentation) should be approached with caution, as 
wrong interpretations result in inconsistent and/or inaccurate conversions. 
The approach of this method is to postpone interpretation (see Step 2b) . 

Instead of developing a new schema (i.e., thesaurus metamodel), one can also use 

an existing thesaurus schema, such as the SKOS (see Sect. 3.4), which already 

defines “Concept”, “broader”, etc. This may be a simpler approach than to 

first develop a new schema and later map this onto the SKOS. However, this 

3 http:/ /www.w3.org/2001/sw/WebOnt/webont-issues.html#44. 3-Structured- 
Datatypes 
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is only a valid approach if the metamodel of the source and of SKOS match. 
A drawback is that the naming of the original metamodel is lost (e.g. “BT” 
instead of “broader”). For thesauri with a (slightly) different metamodel, it is 
recommended to develop a schema from scratch, so as not to lose the original 
semantics, and map this schema onto SKOS in Step 3. 

Step lb: explication of syntax. Step lb concerns the explication of informa- 
tion that is implicit in the source format, but intended by the conceptual model. 
The same set of RDF(S) constructs is used as in Step la. For example, the AAT 
thesaurus [11] uses node labels (called “Guide Terms” in AAT), but in the AAT 
source data these are only distinguished from normal terms by enclosing the 
term name in angle brackets (e.g. <KNIVES by Form>). This information can 
be made explicit by creating a class GuideTerm, which is an rdf s : subClassOf 
the class AATTerm, and assigning this class to all terms with angle brackets. 
Other examples are described in Sects. 4 and 5. 

3.3 Step 2: Semantic Conversion 

In this step the class and property definitions are augmented with additional 
RDFS and OWL constraints. Its two substeps are aimed at explication (Step 
2a) and interpretation (Step 2b). After completion of Step 2a the thesaurus is 
ready for publication on the Web as an “as-is” RDF/OWL representation. 

Step 2a: explication of semantics. This step is similar to Step lb, but 
now more expressive RDFS and OWL constructs may be used. For example, a 
broaderTerm property can be defined as an owl : Trans it iveProperty and a 
relatedTerm property as an owl : SymmetricProperty. 

A technique that is used in this step is to define certain properties as special- 
izations of predefined RDFS properties, e.g. rdf s : label and rdf s : comment. For 
example, if a property nameOf is clearly intended to denote a human-readable 
label for a resource, it makes sense to define this property as a subproperty 
of rdfs: label. RDFS-aware tools will now be able to interpret nameOf in the 
intended way. 

Step 2b: interpretations. In Step 2b specific interpretations are introduced 
that are strictly speaking not sanctioned by the original model or documen- 
tation. A common motivation is some application-specific requirement, e.g. an 
application wants to treat a broaderTerm hierarchy as a class hierarchy. This 
can be stated as follows: broaderTerm rdfs : subPropertyOf rdfs : subClassOf . 
Semantic Web applications using thesauri will often want to do this, even if not 
all hierarchical links satisfy the subclass criteria. This introduces the notion of 
metamodeling. It is not surprising that the schema of a thesaurus is typically a 
metamodel: its instances are categories for describing some domain of interest. 

Guideline 8: Consider treating the thesaurus schema as a metamodel. The 
instances of a thesaurus schema are often general terms or concepts, that oc- 
cur as classes in other places. RDFS allows one to treat instances as a classes: 
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simply add the statement that the class of those instances is a subclass of 
rdfs: Class. For example, an instance i is of class C; class C is declared 
to be an rdf s : subClassOf rdfs: Class. Because instance i is now also an 
instance of rdfs: Class, it can be treated as a class. 

The above example of treating broader term as a subclass relation is similar 
in nature. 

A schema which uses these constructions is outside the scope of OWL DL. 
Application developers will have to make their own expressivity vs. tractabil- 
ity trade-off here. 

The output of this step should be used in applications as a specific interpretation 
of the thesaurus , not as a standard conversion. 



3.4 Step 3: Standardization 

Several proposals exist for a standard schema for thesauri. 4 Such a schema may 
enable the development of infrastructure that can interpret and interchange 
thesaurus data. Therefore, it may be useful to map a thesaurus onto a stan- 
dard schema. This optional step can be made both after Step 2a (the result 
may be published on the web) and Step 2b (the result may only be used in an 
application-specific context). Unfortunately, a standard schema has not yet been 
agreed upon. As illustration, the schema of the W3C Semantic Web Advanced 
Development for Europe (SWAD-E) project 5 is mapped to MeSH in Sect. 4. 

The so-called “SKOS” schema of SWAD-E is concept-based, with class 
Concept and relations narrower, broader and related between Concepts. A 
Concept can have a prefLabel (preferred term) and altLabels (non-preferred 
terms). Also provided is a TopConcept class, which can be used to arrange a 
hierarchy under special concepts (such as fields and facets, see Sect. 2). TopCon- 
cept is a subclass of Concept. Note that because SKOS is concept-based, it may 
be problematic to map term-based thesauri. 



4 Case One: MeSH 

This section describes how the method has been applied to MeSH (version 
2004 6 ). The main source consists of two XML files: one containing so-called 
descriptors (228 MB), and one containing qualifiers (449 Kb). Each has an as- 
sociated DTD. A file describing additional information on descriptors was not 
converted. The conversion program (written in XSLT) plus links to the original 
source and output files of each step can be found at http://thesauri.cs.vu.nl/. 
The conversion took two persons approximately three weeks to complete. 

4 http:/ /www.w3.org/2001/sw/Europe/reports/thes/thes_links.html 

5 http:/ /www.w3.org/2001/sw/Europe/reports/thes/l. 0/guide/ 

6 http:/ /www. nlm.iiih.gov/mesh/hlelist.html 
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4.1 Analysis of MeSH 

The conceptual model of MeSH is centered around Descriptors, which contain 
Concepts [9]. In turn, Concepts consist of a set of Terms. Exactly one Concept 
is the preferred Concept of a Descriptor, and exactly one Term is the preferred 
Term of a Concept. Each Descriptor can have Qualifiers, which are used to 
indicate aspects of a particular Descriptor, e.g. “ABDOMEN” has the Qualifiers 
“pathology” and “abnormalities”. Descriptors are related in a polyhierarchy, 
and are meant to represent broader/narrower document retrieval sets (i.e., not a 
subclass relation). Each Descriptor belongs to one (or more) of fifteen Categories, 
such as “Anatomy” and “Diseases” [10]. The Concepts contained within one 
Descriptor are also hierarchically related to each other. 

This model is inconsistent with the ISO and ANSI standards, for several 
reasons. Firstly, the model is concept-based. Secondly, Descriptors contain a set 
of Concepts, while in the standards a Descriptor is simply a preferred term. 
Thirdly, Qualifiers are not used to disambiguate homonyms. 

4.2 Converting MeSH 

Step la: structure-preserving translation. In the XML version of MeSH, 
Descriptors, Concepts, Terms and Qualifiers each have a Unique Identifier (UI). 
Each Descriptor also has a TreeNumber (see [1]). This is used to indicate a 
position in a polyhierarchical structure (a Descriptor can have more than one 
TreeNumber), but this is implicit only. Relations between XML elements are 
made by referring to the UI of the relation’s target (e.g. <SeeRelatedDescriptor> 
contains the UI of another Descriptor). In Step la, this is converted into in- 
stances of the property hasRelatedDescriptor. The explication of TreeNum- 
bers is postponed until Step lb. 

Most decisions in Step la concern which XML elements should be translated 
into classes, and which into properties. The choice to create classes for Descrip- 
tor, Concept and Term are clear-cut: these are complex, interrelated structures. 
A so-called <EntryCombination> relates a Descriptor-Qualifier pair to another 
Descriptor-Qualifier pair. Following Guideline 4, two blank nodes are created 
(each representing one pair) and related to an instance of the class EntryCombi- 
nation. As already mentioned, relations between elements in XML MeSH are 
made by referring to the UI of the target. However, each such relation also in- 
cludes the name of the target. As this is redundant information, the name can 
be safely disregarded. 

Guideline 9: Give preference to the relation-as-arc approach over the 
relation- as-node approach. In the relation-as-arc approach, relations are 
modeled as arcs between entities (RDF uses “properties” to model arcs). 
In the relation-as-node approach, a node represents the relation, with one 
arc relating the source entity to the relation node, and one arc relating the 
relation node to the destination entity [7]. The relation-as-arc approach is 
more natural to the RDF model, and also allows for definition of property 
semantics (symmetry, inverseness, etc.) in OWL. 
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It is not always possible to follow Guideline 9, e.g. in the case of MeSH 
<ConceptRelation>. Although the straightforward choice is to create a prop- 
erty that relates two Concepts, this is not possible as each ConceptRelation 
has an additional attribute (see Guideline 4). Again, a blank node is used (the 
relation-as-node approach) . 

Guideline 10: Create proxy classes for references to external resources 
if they are not available in RDF. Each Concept has an associated Seman- 
ticType, which originates in the UMLS Semantic Network'. This external 
resource is not available in RDF, but might be converted in the near fu- 
ture. In MeSH, only the UI and name of the SemanticType is recorded. One 
could either use a datatype property to relate the UI to a Concept (again, 
the redundant name is ignored), or create SemanticType instances (empty 
proxies for the actual types). We have opted for the latter, as this simplifies 
future integration with UMLS. In this scenario, either new properties can be 
added to the proxies, or the existing proxies can be declared owl : sameAs to 
SemanticType instances of a converted UMLS. 



Guideline 11: Only create rdf: IDs based on identifiers in the original 
source. A practical problem in the syntactical translation is what value 
to assign the rdf : ID attribute. If the original source does not provide a 
unique identifier for an entity, one should translate it into blank nodes, as 
opposed to generating new identifiers. A related point is that if the UI is 
recorded using rdf : ID, additional properties to record an entity’s UI would 
introduce redundancy, and therefore shouldn’t be used. 

Guideline 12: Use the simplest solution that preserves the intended se- 
mantics. In XML MeSH, only one Term linked to a Concept is the preferred 
term. Some terms are permutations of the term name (indicated with the 
attribute isPermutedTermYN), but unfortunately have the same UI as the 
Term from which they are generated. A separate instance cannot be created 
for this permuted term, as this would introduce a duplicate rdf : ID. Two 
obvious solutions remain: create a blank node or relate the permuted term 
with a datatype property permutedTerm to Term. In the first solution, one 
would also need to relate the node to its non-permuted parent, and copy 
all information present in the parent term to the permuted term node (thus 
introducing redundancy). The second solution is simpler and preserves the 
intended semantics. 



Step lb: explication of syntax. In Step lb, three explications are made. 
Firstly, the TreeNumbers are used to create a hierarchy of Descriptors with 
a [Descriptor] subTreeOf [Descriptor] property. Secondly, the TreeNum- 
ber starts with a capital letter which stands for one of fifteen Categories. The 

' http: / /www. nlm.iiih.gov/pubs/factslieets/umlssemn.html 
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class Category and property [Descriptor] inCategory [Category] are intro- 
duced to relate Descriptors to their Catogories. Thirdly, the ConceptRelations 
are translated into three properties, brd, nrw and rel, thus converting from a 
relation-as-node to a relation-as-arc approach (see Guidelines 12 and 9). This 
requires two observations: (a) the values NRW, BRD and REL of the attribute 
relationName correspond to narrower, broader and related Concepts; and (b) 
the relationAttribute is not used in the actual XML, and can be removed. 
Without the removal of the relationAttribute, the arity of the relation would 
have prevented us from using object properties. 

Some elements are not explicated, although they are clear candidates. These 
are XML elements which contain text, but also implicit information that can be 
used to link instances. For example, a Descriptor’s <RelatedRegistryNumber> 
contains the ID of another Descriptor, but also other textual information. Split- 
ting this information into two or more properties changes the original semantics, 
so we have chosen to create a datatype property for this element and copy the 
text as a literal value. 

Step 2a: explication of semantics. In Step 2a, the following statements are 
added (a selection): 

— The properties brd and nrw are each other’s inverse, and are both transitive, 
while rel is symmetric; 

— A Concept’s scopeNote is an rdf s : subPropertyOf the property 
rdf s : comment; 

— All properties describing a resource’s name (e.g. descriptorName) are de- 
clared an rdf s : subPropertyOf the property rdfs: label; 

— Each of these name properties is also an owl : InverseFunctionalProperty, 
as the names are unique in the XML file. Note that this may not hold for 
future versions of MeSH; 

— All properties recording a date are an owl : FunctionalProperty; 

— The XML DTD defines that some elements occur either zero or once in the 
data. The corresponding RDF properties can also be declared functional; 

— As a Term belongs to exactly one Concept, and a Concept to exactly one De- 
scriptor, [Concept] hasTerm [Term] as well as [Descriptor] hasConcept 
[Concept] is an owl : InverseFunctionalProperty; 

Unfortunately, the relation represented by class EntryCombination cannot be 
supplied with additional semantics, e.g. that it is an owl : SymmetricProperty 
(see Guideline 9). 

Step 2b: interpretations. In Step 2b, the following interpretations are made, 
following Guideline 8. Note that these are examples, as we have no specific 
application in mind. 

— brd is an rdf s : subPropertyOf rdf s : subClassOf ; 

— Descriptor and Concept are declared rdf s : subClassOf rdfs: Class. 
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Step 3: standardization. In Step 3, a mapping is created between the MeSH 
schema and the SKOS schema. The following constructs can be mapped (using 

rdf s : subPropertyOf and rdf s : subClassOf ): 

— mesh: subTreeOf onto skos : broader; 

— mesh: Descriptor onto skos : Concept; 

— mesh:hasRelatedDescriptor onto skos : related; 

— mesh:descriptorName onto skos :pref Label. 

There is considerable mismatch between the schemas. Descriptors are the cen- 
tral concepts between which hierarchical relations exist, but it is unclear how 
MeSH Concepts and Terms can be dealt with. SKOS defines datatype proper- 
ties with which terms can be recorded as labels of Concepts, but this cannot 
be mapped meaningfully onto MeSH’ Concept and Term classes. For example, 
mesh: conceptName cannot be mapped onto skos :pref Label, as the former’s do- 
main is mesh : Concept, while the latter’s domain is skos : Concept (skos:Concept 
is already mapped onto mesh:Descriptor). Furthermore, the mesh : Category can- 
not be mapped onto skos : TopCategory, because TopCategory is a subclass of 
skos : Concept, while mesh : Category is not a subclass of mesh descriptor. 



5 Case Two: WordNet 

This section describes how the method has been applied to WordNet release 
2.0. The original source consists of 18 Prolog files (23 MB in total). The con- 
version programs (written in Prolog) plus links to the original source as well as 
the output files of each step can be found at http://thesauri.cs.vu.nl/. The 
conversion took two persons approximately three weeks to complete. Note that 
Step 3 for WordNet is not discussed here for reasons of space, but is available at 
the forementioned website. 

5.1 Analysis of WordNet 

WordNet [2] is a concept-based thesaurus for the English language. The concepts 
are called “synsets” which have their own identifier. Each synset is associated 
with a set of lexical representations, i.e. its set of synonyms. The synset concept 
is divided into four categories, i.e. nouns, verbs, adverbs and adjectives. Most 
WordNet relations are defined between synsets. Example relations are hyponymy 
and meronymy. 

There have been a number of translations of WordNet to RDF and OWL 
formats. Dan Brickley 8 translated the noun/hyponym hierarchy directly into 
RDFS classes and subclasses. This is different from the method we propose, 
because it does not preserve the original source structure. Decker and Melnik 9 
have created a partial RDF representation, which does preserve the original 

8 http:/ /lists. w3.org/ Archives/Public/www-rdf-interest/1999Dec/0002.html 

9 http:/ /www. semanticweb.org/library/ 
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structure. The conversion of the KID Group at the University of Neuclratel 10 
constitutes an extension of representation both in scope and in description of 
semantics (by adding OWL axioms). We follow mainly this latter conversion and 
relate it to the steps in our conversion method. In the process we changed and 
extended the WordNet schema slightly (and thus also the resulting conversion) . 

5.2 Converting WordNet 

Step la: structure-preserving translation. In this step the baseline classes 
and properties are created to map the source representation as precisely as pos- 
sible to an RDF representation: 

— Classes: SynSet, Noun, Verb, Adverb, Adjective (subclasses of SynSet), 
AdjectiveSatellite (subclass of Adjective); 

— Properties: wordForm, glossaryEntry, hyponymOf, entails, similarTo, 
memberMeronymOf , substanceMeronymOf , partMeronymOf , derivation, 
causedBy, verbGroup, attribute, antonymOf, seeAlso, participleOf , 
pertainsTo. 

Note that the original WordNet naming is not very informative (e.g. “s” repre- 
sents synset). For readability, here we use the rdf s : labels that have been added 
in the RDF version. All properties except for the last four have a synset as their 
domain. The range of these properties is also a synset, except for wordForm and 
glossaryEntry. Some properties have a subclass of SynSet as their domain 
and/or range, e.g. entails holds between Verbs. 

The main decision that needs to be taken in this step concerns the following 
two interrelated representational issues: 

1. Each synset is associated with a set of synonym “words”. For example, 
the synset 100002560 has two associated synonyms, namely nonentity and 
nothing. Decker and Melnik represent these labels by defining the (multi- 
valued) property wordForm with a literal value as its range (i.e. as an OWL 
datatype property). The Neuclratel approach is to define a word as a class 
in its own right (WordObject). The main disadvantage of this is that one 
needs to introduce an identifier for each WordObject as it does not exist in 
the source representation, and words are not unique (homonymy). 

2. The last four properties in the list above (antonymOf, etc.) do not represent 
relations between synsets but instead between particular words in a synset. 
This also provides the rationale for the introduction of the class WordObject 
in the Neuclratel representation: antonymOf can now simply defined as a 
property between WorclObjects. 

We prefer to represents words as literal values, thus avoiding the identifier prob- 
lem (see Guideline 11). For handling properties like antonymOf we defined a 
helper class SynSetWord with properties linking it to a synset and a word. For 
each subclass of SynSet, an equivalent subclass of SynSetWord is introduced (e.g. 
SynSetVerb). A sample representation of an antonym looks like this: 

10 http: / /taurus.unine.ch/GroupHome/knowler/wordnet.html 
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<SynSetWord> 

<inSynSet rdf : resource="#100017087"/> 
<relevantWord>natural_object</relevantWord> 

<antonymOf > 

<SynSetWord> 

<inSynSet rdf : resource="#100019244"/> 

<relevantWord>artif act</relevantWord> 

</SynSetWord> 

</ antonymOf > 

</SynSetWord> 

In this example, the word natural object in synset 100017087 is an antonym 
of the word artifact in synset 100019244. 

Step lb: explication of syntax. The source representation of WordNet does 
not contain many implicit elements. The only things that need to be added 
here are the notions of hypernymy and holonymy (three variants). Both are 
only mentioned in the text and are apparently the inverse 11 of respectively the 
lryponym relation and the three meronym variants. Consequently, these four 
properties were added to the schema. 

Step 2a: explication of semantics. In this step additional OWL axioms can 
be introduced to explicate the intended semantics of the WordNet classes and 
properties. A selection: 

— Noun, Verb, Adverb, and Adjective together form disjoint and complete 
subclasses of SynSet; 

— hyponymOf and hypernymOf are transitive properties; 

— hyponymOf only holds between nouns or verbs; 12 

— hyponymOf /hypernymOf and the three variants of meronymOf /holonymOf are 
inverse properties; 

— verbGroup and antonymOf are symmetric properties. 

In addition, we defined the properties wordForm and glossaryEntry as sub- 
properties of respectively rdf s : label and rdf s : comment. 

From the WordNet documentation it is clear that these properties have this 
type of intended semantics. The alternative for defining these as subproperties 
would have been to use rdfs: label and rdfs: comment directly in the RDF 
representation, thus dropping the original names. This makes the traceability of 
the conversion less clear. 

11 The WordNet documentation uses the term “reflexive”, but it is clear that inverse- 
ness is meant. 

12 In the Neuchatel representation an intermediate class NounsAndVerbs is introduced 
to express these constraints. This is not needed, as OWL supports local property re- 
strictions which allow one, for example, to state that the value range of the hyponymOf 
property for Noun must be Noun. 
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Step 2b: interpretations. We have used WordNet heavily in semantic anno- 
tation of images (see e.g. [5]). In that application we used the WordNet hierarchy 
as an RDFS subclass tree by adding the following two metastatements: 

— wn:SynSet rdf s : subClassOf rdfs: Class; 

— wn:hyponymOf rdf s : subPropertyOf rdf s : subClassOf 

Tools such as Triple20 13 will now be able to visualize the synset tree as a sub- 
class tree. The repercussions of this type of metamodeling for RDF storage and 
retrieval are discussed in [14]. 

6 Related Research 

Soualmia et al. [13] describe a migration of a specialized, French version of MeSH 
to an OWL DL representation. Their goal is to improve search in a database 
of medical resources. We mention a few of their modeling principles. Firstly, 
they “clean” the taxonomy by distinguishing between part-of and is-a relation- 
ships (the former type are translated into a partOf property, the latter into 
rdf s : subClassOf ). Secondly, qualifiers are translated into properties, and their 
domains are restricted to the union of the descriptors on which they may be 
applied. The properties are hierarchically organized using rdf s : subPropertyOf 
according to the qualifier hierarchy. 

Wroe et al. [16] describe a methodology to migrate the Gene Ontology (GO) 
from XML to DAML+OIL. Their goal is to “support validation, extension and 
multiple classification” of the GO. In each step, the converted ontology is en- 
riched further. For example, three new part_of relations are introduced. Also, 
new classes are added to group part_of instances under their parent compo- 
nents, which enables visualization of the hierarchy. MeSH and the KEGG en- 
zyme database are used to enrich class definitions so a new classification for 
Gene enzyme functions can be made by a reasoner. Additional modeling of class 
restrictions allowed the same reasoner to infer 17 is-a relationships that were 
omitted in the original source. 

Goldbeck et al. [4] describe a conversion of the NCI (National Cancer In- 
stitute) Thesaurus from its native XML format to OWL Lite. Their goal is to 
“make the knowledge in the Thesaurus more useful and accessible to the pub- 
lic”. A mapping of XML tags onto OWL classes and properties is defined, based 
on an analysis of the underlying conceptual model and NCI’s thesaurus develop- 
ment process, rdf : IDs are created using a transformation of the original concept 
names (spaces removed, illegal characters substituted). This is a reasonable ap- 
proach, under the assumption that names are indeed and will remain unique (see 
Guideline 11). 

There are two main differences with our work. Firstly, the forementioned 
projects do not separate between “as-is” conversion and enrichment steps, as 
our method does. Therefore, the conversions may only be usable for the project’s 

13 http:/ /www. swi-prolog.org/packages/Triple20/ 
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own goals. Secondly, we try to generalize over specific conversions and aim to 
define a more general conversion process. 

In the SWAD-Europe project a schema is being developed to encode RDF 
thesauri. Additionally, work is in progress to produce a guideline document on 
converting multilingual thesauri [8] . The development of a standard schema in- 
fluences our standardization step, and guidelines on converting multilingual the- 
sauri may be incorporated to broaden the scope of our method. 

7 Discussion 

This paper has presented a method to convert existing thesauri to RDF(S) and 
OWL, in a manner that is sanctioned by the original documentation and format. 
Only in a separate step may interpretations be made for application-specific 
purposes. 

Two additional aims of converting existing resources to the RDF model may 
be identified. Firstly, the quality of a thesaurus may be improved using the 
semantics of RDF(S), as some thesauri use relations with weak or unclear se- 
mantics, or apply them in an ambiguous way [12]. Secondly, converted thesauri 
can be checked using standard reasoners, identifying e.g. missing subsumption 
and inverse relations (e.g. BT/NT). 

Recently the W3C has installed the Semantic Web Best Practices and De- 
ployment (SWBPD) Working Group 14 , which aims to provide guidelines for 
application developers on how to deploy Semantic Web technology. This method 
may serve as input for and might be extended by this Working Group. 

Several issues remain as future research in developing a methodology. Firstly, 
translating between the source model and the RDF model is a complex task with 
many alternative mappings, especially for thesauri that do not conform to or 
define extensions of the ISO and ANSI standards. Secondly, more guidelines are 
required on how to convert multilingual thesauri. Thirdly, a standard thesaurus 
schema is required to perform step three of our method. It is clear that the 
current SWAD-E proposal does not completely cover complex thesauri such as 
MeSH. An open question is whether the proposal might be extended or that this 
type of thesaurus is simply outside the scope, as MeSH deviates from the ISO 
and ANSI standards. 
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Abstract. A central theme of the Semantic Web is that programs 
should be able to easily aggregate data from different sources. Unfor- 
tunately, even if two sites provide their data using the same data model 
and vocabulary, subtle differences in their use of terms and in the as- 
sumptions they make pose challenges for aggregation. Experiences with 
the TAP project reveal some of the phenomena that pose obstacles to a 
simplistic model of aggregation. Similar experiences have been reported 
by AI projects such as Cyc, which has led to the development and use 
of various context mechanisms. In this paper we report on some of the 
problems with aggregating independently published data and propose a 
context mechanism to handle some of these problems. We briefly survey 
the context mechanisms developed in AI and contrast them with the re- 
quirements of a context mechanism for the Semantic Web. Finally, we 
present a context mechanism for the Semantic Web that is adequate to 
handle the aggregation tasks, yet simple from both computational and 
model theoretic perspectives. 



1 Introduction 

The ease with which web sites could link to each other doubtless contributed to 
the rapid adoption of the World Wide Web. It is hoped that as the Semantic Web 
becomes more prevalent, programs will be able to similarly weave together data 
from diverse sites. Indeed, the data model behind RDF was significantly moti- 
vated by the fact that directed labeled graphs provided a simple, but effective 
model for aggregating data from different sources. 

Unfortunately, while the languages of the Semantic Web (RDF, RDFS, OWL, 
etc) provide a method for aggregation at the data model level, higher level differ- 
ences between data sources sometimes make it inappropriate to directly merge 
data from them. Just as with the human readable web, Semantic Web publish- 
ers make assumptions and use the same term in subtly different ways. On the 
human readable web, the human consumers of web pages are able to use their 
common sense to reconcile these differences. On the Semantic Web, we need to 
develop the mechanisms that allow us to explicitly represent and reason with 
these assumptions and differences. This will enable the programs that consume 
data from the Semantic Web to reconcile these differences, or at least avoid the 
problems that arise by applying an overly simple aggregation model to such data 
sources. 
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In the past, AI researchers have encountered similar issues when aggregating 
structured knowledge from different people or even the same person at different 
times. To handle these issues, mechanisms such as contexts and micro-theories 
have been proposed and implemented in projects such as Cyc [17]. While the 
kind of phenomena encountered in those projects are substantially more intri- 
cate and unlikely to be encountered in the near future on the Semantic Web, the 
scale and federated nature of the Semantic Web pose a new set of challenges. 
We believe that a context mechanism that is similar in spirit to the earlier con- 
text mechanisms will be not only useful, but required to achieve the Semantic 
Web vision. However, the differences between AI systems and the Semantic Web 
also mean that a context mechanism for the Semantic Web will have substantial 
differences from the AI context mechanisms. In this paper, we discuss the mo- 
tivation for some of the basic requirements and present some possibilities for a 
context mechanism for the Semantic Web. 

We begin by recounting some of the problems in aggregating data from dif- 
ferent sources that were encountered on the TAP [20] project. These examples 
provide the motivation for the capabilities required of a context mechanism for 
the Semantic Web. We then present a simple context mechanism for handling 
these problems. After that, we discuss the model theory extensions required 
to incorporate this mechanism into RDF. Finally, we discuss related work and 
Semantic Web issues and constructs related to contexts. 

2 Overview of TAP 

The TAP project [20], [18] has over the last three years attempted to create 
platforms for publishing to and consuming data from the Semantic Web. Using 
this platform, we have built a number of applications [19] both to validate our 
assumptions and to help bootstrap the Semantic Web. 

Building anything more than the simplest toy applications requires a sub- 
stantial amount of data. Unfortunately, the Semantic Web, in its current stage, 
has little to offer in this regard. On the other hand, we do have a very large 
body of unstructured knowledge in the human readable World Wide Web. So, 
both to solve our problem and to bootstrap the Semantic Web, we have cre- 
ated a large scale knowledge extraction and aggregation system called onTAP. 
onTAP includes 207 HTML page templates which are able to read and extract 
knowledge from 38 different high quality web sites. The HTML template system 
has currently read over 150,000 pages, discovering over 1.6 million entities and 
asserting over 6 million triples about these entities. The system that aggregates 
this data can run in either a dynamic mode, in response to a query from an 
application, or in a batch mode. In batch mode, this aggregator is used to create 
a classification index for keyword-based queries and for scanning of documents 
for referenced entities. 

onTAP also includes a system to read political news articles from Yahoo! 
News. The articles in this news feed are matched against natural language pat- 
terns to extract entities, attributes of and relationships between these entities. 
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Work is currently underway to expand this system to read more news sources. 
This system has read 46,113 articles to discover 6,052 unique entities and 137,082 
triples about them. 

Most of the content of TAP is obtained via these templates that draw on 
HTML pages that are targeted at humans. However, we believe that obser- 
vations regarding the contextual phenomena associated with this data can be 
extrapolated to the case where the data is directly published by the sites in 
a structured form. Most of the sites that TAP uses do have their content in 
a structured form in a database. Front end processors query the database and 
format the data into HTML. When a site such as Amazon makes its data avail- 
able in a structured form, they publish the same data. That is, the format and 
markup language used are different, but the assumptions, approximations and 
other contextual phenomena that are in their HTML pages are also present in 
the structured data that they publish. 

3 Contextual Phenomena 

In this section we discuss some of the contextual phenomena that we observed 
in the process of building the TAP knowledge base that cause problems in data 
aggregation. The examples of contextual phenomena we observed can be classi- 
fied into a small number of varieties of contexts. Here we describe each of these 
varieties along with some examples that we encountered. We also discuss how 
we would like to handle each of these cases during aggregation. 

Class Differences. Different sites often use a particular class in slightly differ- 
ent ways. Sites may differ on the meaning of seemingly unambiguous concepts 
such as Person. For example, the site Wlro2 classifies C3PO (the robot in 
Star Wars) as a person, whereas most other sites classify it as a robot. During 
aggregation, we should map Who2’s use of Person to something more gen- 
eral. Alternately, we can believe what Wlro2 has to say about some resource, 
unless what it says contradicts what some other site says. 

Propositional Attitude. A related phenomenon is that of a site having an 
implicit propositional attitude. For example, many sites providing reviews of 
television shows specify that Josialr Bartlet ( a character who plays the role of 
the President of the US in the television series ‘West Wing’) is the President 
of the United States. During aggregation, the propositional attitude of these 
statements should be made explicit. 

Property Type Differences. A common source of differences between sites 
is that property types such as capacity and price are used differently. An 
example is the capacity of nuclear power plants. These plants have two dif- 
ferent kinds of capacities: a design capacity and an actual capacity. Some 
sites specify the design capacity while others specify the actual capacity, 
but in most cases they just refer to it as the capacity. When aggregating, 
we need to either create a generalization of the two capacities or determine 
which capacity a particular site is referring to and map the capacity on that 
site to the appropriate version of capacity in the aggregate. 
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Point of View. More substantial differences occur when there are conflicting 
points of view. Is Taiwan a country of its own, or a province of China? The 
answer depends very strongly on which site is queried. Similarly, different 
sites classify Hamas as a terrorist organization, a freedom fighting organiza- 
tion and a humanitarian organization. This kind of subjective data is often 
mixed in with more objective data (like the head of Hamas or the capital 
of Taiwan) . When aggregating data from these sites, we would like to either 
make the subjectivity explicit (through the use of propositional attitudes) 
or only selectively import those facts that are not contentious. 

Implicit Time. Sites often publish a piece of data that is true at the time 
of publication, with the temporal qualification being left implicit. Equally 
often, this data does not get updated when it no longer holds. There are a 
number of sites that list Bill Clinton as the President of the US, that refer 
to Yugoslavia as a country, etc. Unfortunately, such implicitly temporally 
qualified data is often mixed in with data that is not temporally qualified. 
For example, it is still true that Bill Clinton graduated from Yale and the 
latitude and longitude of Sarajevo have not changed. When aggregating data 
from these sites, we would like to either make the time explicit or only 
selectively import those facts that are not likely to have changed. 

A similar phenomenon is that of a site leaving the unit of measure associated 
with an attribute value implicit. So, instead of specifying the price as US$40, 
they might simply say 40. In such cases, we need to either make the unit of 
measure explicit or perform the appropriate conversion. 

Approximations. Approximations are another source of differences between 
sites. For example, the CIA World Factbook provides approximate values 
for the population, area, etc. of all countries. More accurate numbers are 
typically available from the governments of each of these countries, only 
some of which are online. We like to be able to combine data from the CIA 
World Factbook and data from the governments, preferring the government 
data when it is available. 

We recognize that these differences could be because the TAP data was 
obtained by extracting structured data from HTML pages that were intended 
for human consumption. However, these phenomena are not an artifact of the 
information being published in an unstructured format. Most of the information 
on these sites is drawn from structured databases and these phenomena 
manifest themselves in these databases as well. Consequently, we believe that 
these problems will persist even when the data is made available in a machine 
readable form. 

These kinds of differences between sites pose problems when data from these 
sites is aggregated. The problem is not that some of these sites are not trust- 
worthy or that all of their data is bad. In fact, sources of data such as the CIA 
Factbook and Wlro2 are rich and useful repositories that should be part of the 
Semantic Web. What we need is a mechanism to factor the kinds of differences 
listed above as part of the data aggregation process. Various formalizations of 
context mechanisms have been proposed in the AI literature to handle this pro- 
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cess of factoring differences in representations between knowledge bases. In the 
next section we first briefly review some of the context-related concepts from 
AI, note the differences between those and our requirements for the Semantic 
Web and then, in the following section, propose a simplified version to solve our 
problem for the Semantic Web. 

4 Contexts in AI 

Contexts as first class objects in knowledge representation systems have been 
the subject of much study in AI. KR systems as early as KRL [2] incorporated a 
notion of contexts. The first steps towards introducing contexts as formal objects 
were taken by McCarthy ([14], [15]) and Guha [10] in the early 1990s. This was 
followed by a number of alternate formulations and improvements by Buvac [6] 
[4], Fikes [5], Giunchiglia [8], Nayak [16] and others. Contexts/Microtheories 
are an important part of many current KR systems such as Cyc [17]. Contexts 
remain an active area of research in AI and Philosophy. 

Contexts have been used in AI to handle a wide range of phenomena, a cate- 
gorization of which can be found in [9] . They have been used in natural language 
understanding to model indexicals and other issues that arise at the semantic 
and pragmatic processing layers. They have found extensive use in common- 
sense reasoning systems where they are used to circumscribe the scope of naive 
theories. These systems use nested contexts, with the system being able to tran- 
scend the outermost context to create a new outer context. Both common sense 
and natural language systems also have a class of contexts that are ephemeral, 
that might correspond to a particular utterance or to a particular problem solv- 
ing task. The ability to easily introduce a new context and infer attributes of 
that context adds substantial complexity to these context mechanisms. Contexts 
have also been used in model based reasoning [16] to partition models at different 
levels of abstraction. 

The scope and complexity of the AI problems that contexts have been em- 
ployed for is substantially more than anything we expect to encounter on the 
Semantic Web. The primary role for contexts on the Semantic Web is to factor 
the differences (like the ones described earlier) between data sources when aggre- 
gating data from them. Consequently, we do not need nested contexts, ephemeral 
contexts and the ability to transcend contexts. 

On the other hand, the expected scale and highly distributed nature of the 
Semantic Web is in contrast to AI systems, most of which are much smaller and 
centralized. So, while we don’t need the level of functionality provided by the AI 
formulations of contexts, we do place stronger constraints on the computational 
complexity and ease of use of the context mechanism. 

In the next section, we develop a context mechanism for the Semantic Web. 
In the following section, we discuss the model theory extensions required for this 
mechanism. 
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5 Contexts for the SW 

We present our context mechanism in the following setting. We have Semantic 
Web content (in RDFS or one of the OWL dialects) from a number of different 
URLs. The content from each URL is assumed to be uniform, but there may be 
differences like the ones described earlier between the content from the different 
URLs. We would like to create an internally consistent aggregate of the data 
from these different sites. The aggregate should be maximal in the sense that it 
should incorporate as much of the data from the different URLs as possible. 

Each data source (or collection of data sources) is abstracted into a context. 
Contexts are first class resources that are instances of the class Context?. We 
define a PropertyType ( contextURL ) whose domain is Context that specifies the 
location(s) of the data source(s) corresponding to the context. The contents of 
the data source (s) are said to be true in that context. For the sake of keeping the 
description simple, for the remainder of this paper, unless otherwise specified, 
we will assume that the context has a single data source and that the URL of the 
context is that of the data source. So, for example, if a chunk of RDF is available 
at the URL tap.stanford.edu/People.rdf , we can have a context corresponding to 
this URL and the contents of this URL are said to be true in that context. Since 
the context is a resource, like any other resource on the Semantic Web, other 
chunks of RDFS/OWL can refer to it. 

We are interested in defining contexts that aggregate data from other con- 
texts. In keeping with the spirit of the Semantic Web, we would like to do this by 
declaratively specifying these new Aggregate Contexts. The different mechanisms 
that may be used in specifying these aggregate contexts define our design space. 
We start by examining the very general mechanism used in [10], [15] which has 
been followed by others in the AI community. We then present a much simpler, 
though less expressive variant that might be adequate for the semantic web. 

Gulra [10] and McCarthy [15] introduced a special symbol ist and the nota- 
tion ist(ci,ip) to state that a proposition ip is true in the context c*. Further, 
these statements can be nested, so that statements like ist(ci,ist(cj,ip )) can be 
used to contextualize the interpretation of contexts themselves. The system is 
always in some context. The system can enter and exit contexts. At any point in 
time, there is an outermost context, which can be transcended by creating a new 
outer context. A symbol can denote different objects in different contexts and 
the domains associated with different contexts can be different. Since contexts 
are first class objects in the domain, one can quantify over them, have functions 
whose range is a context, etc. All this allows one to write very expressive formu- 
lae that lift axioms from one context to another. While this is very convenient, it 
also makes it quite difficult to provide an adequate model theory and extremely 
difficult to compute with. 

Nayak [16], Buvac [3] and others have tried to simplify this general formu- 
lation by introducing restrictions. Nayak considers the case where no nesting is 

3 We will drop namespace qualifiers, etc. for the sake of readability. Terms in this font 
refer to RDF resources. 
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allowed, contexts may only appear as the first argument to ist and are not first 
class objects (i.e. , cannot be quantified over, etc.). Under these constraints, a 
classical modal logic like S5 can be used to provide a model theory for contexts. 
Buvac considers the case where a symbol is restricted to denote the same object 
in all contexts. For the purposes of the Semantic Web, Nayak’s restrictions seem 
rather severe, but Buvac’s may be acceptable. Both assume the Barcan formula: 

V(x)ist(c,ip( x)) -B- ist(c,W(x)p(x)) (1) 

While these restrictions allow them to define a clean model theory, they are not 
enough to give us computational tractability. In fact, it is easy to show that all 
of these logics are not even semi-decidable. 

Giunchiglia [8] and other researchers at Trento have used the notion of a 
bridge rule to formalize contexts. Much of their work is in the propositional 
realm and hence not directly relevant to the Semantic Web. 

A general theme behind all these approaches is the introduction of the single 
interpreted symbol ist. ist is the only new symbol for which the underlying 
logic provides an interpretation. This is in contrast with RDF, RDFS, etc. in 
which a number of symbols (e.g., Class, PropertyType, subClassOf, ...) are all 
interpreted by the logic. We now extend this approach to handle contexts. 

An aggregate context is a context whose contents (i.e., the statements that 
are true in that context) are lifted from other contexts. That is, they are im- 
ported into the aggregate context from other contexts after appropriate normal- 
ization/factoring. We now introduce a number of vocabulary elements that can 
be used to specify this lifting. This list is not exhaustive, but is adequate to cover 
the most common types of contextual differences that we discussed in section 3. 
We give informal descriptions and axiomatic definitions of these terms here and 
in the next section, outline the approach for defining the model theory for these 
constructs. We will use ist as a predicate in the meta-tlreory for the axiomatic 
definitions, even though it is not part of the base language. 

— AggregateContext: This subclass of contexts corresponds to aggregate 
contexts. Since the Semantic Web allows anyone to make statements about 
any resource, there can be complications when different sites provide differ- 
ent definitions for a particular aggregate context. More specifically, allowing 
other contexts to specify what should be imported into a context, while safe 
in simple languages like RDF /S, opens the doors to paradoxes in more ex- 
pressive languages like OWL. Even with RDF/S, it is important that the lift- 
ing process be simple to execute. To achieve this, we constrain the URL of an 
aggregate context to contain the full specification of what it imports. In other 
words, a lifting rule for importing into a particular context is true only in 
that context. We will later consider a semantic constraint for enforcing this. 

— importsFrom: This is a property type whose domain is AggregateContext 
and whose range is Context. If Ci importsFrom. C2, then everything that is 
true in C 2 is also true in ci. The defining axiom for importsFrom is as follows. 



ist(c 2 ,p) A ist(cl, import sFrom(ci, C 2 )) —> ist(c\,p) 



(2) 
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We clo not allow cycles in the graph defined by importsFrom. importsFrom. 
is the simplest form of lifting and corresponds to the inclusion of one context 
into another. The more sophisticated forms of lifting require us to create a 
resource for the lifting rule. 

LiftingRule: This class has all the lifting rules as its instances. Each 
LiftingRule must have a single AggregciteC ontext as the value for 
targetContext and single Context for s our ceC ontext. We have subclasses 
of LiftingRule for the different kinds of lifting we would like to perform. 
An AggregateC ontext may have any number of lifting rules that lift into it. 
Lifting rules generally ignore some of the triples in the source context, 
import some of the triples without any modification, transform and then 
import some other set of triples and optionally add some new set of triples 
into the aggregate context. The specification of a lifting rule involves 
specifying which triples to import or which triples to add and how to 
perform the necessary transformations. Some lifting rules may also specify 
a preference for one source over another for some class of triples. Our goal 
here is not to specify an exhaustive set of transformations, but to cover 
some of the important ones and provide a flavor for the general approach. 
An important factor that impacts the representation of these LiftingRule s 
is whether the aggregation process is restricted to be monotonic. If the 
process is allowed to be non- monotonic, the addition of a new LiftingRule 
to an aggregate context may cause certain triples to no longer hold. Non- 
monotonic lifting rules have the ability to say that everything not explicitly 
specified to be ignored or modified is to be imported. Consequently, they 
are easier to write, but do have the disadvantages of non-monotonicity. We 
describe the monotonic version and then suggest how it might be made 
more terse by introducing a non- monotonic construct. 

Selective Importing: These lifting rules explicitly specify the triples that 
should be directly imported from the source to the destination. Each triple 
may be specified by the property type or the first argument or the second 
argument. Optionally, these importing rules can specify the constraints on 
the first/second argument or combinations of property type and constraints 
on first/second argument etc. Examples: import capitalCity and area from 
the CIA Factbook. Import everything for instances of Book and AudioCD 
from Amazon. Import manufacturer for instances of ElectronicsProduct 
from Amazon. The defining axiom for Selective Importing Rules is as follows: 

istfci , targetContextflr,Ci ) A sourceContext(lr,Cj) A sourceFilter(lr, sc ) A 
targetFilter(lr, tc) A propFilter{lr 1 p) A 
type{lr 1 SelectivelmportLi f ting Ride)) A 
ist(cj, type(x, sc) A type(y, tc) A p{x, y)) — > 
ist(d,p(x,y)) 

(3) 

Preference Rules: In many cases, a number of sources have sparse data 
about a set of entities i.e., each of them might be missing some of the 
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attributes for some of the entities they refer to and we might want to 
mitigate this sparcity by combining the data from the different sources. 
However, we might have a preference for one source over another, if the 
preferred source has the data. A preference rule can either specify a total 
preference ordering on list of sources or simply that one particular source is 
preferred over another. As with Selective Importing lifting rules, a preference 
rule can be constrained to apply to only a particular category of triples. 
Example: Who2 has more detailed information about more celebrities than 
IMDB, but IMDB’s data is more accurate. This class of lifting rule allows 
us to combine Who2 with IMDB, preferring IMDB over Wlro2 if both have 
values for a particular property type for the same individual. 

A more sophisticated (but substantially more computationally complex) 
version of preference lifting rules is in terms of consistency, i.e., if an 
inconsistency is detected in the target context, then triples from the less 
preferred context are to be eliminated first (to try and restore consistency). 
Preference Lifting rules are non-monotonic across contexts in the sense that 
the addition of a new triple in one of the source contexts can cause another 
triple in the target context to be removed. However, they do not induce 
non-monotonicity within a context. The defining axiom for Preference 
Lifting Rules is as follows: 

ist(ci , targetContext{lr, cij) AsourceC ontext(lr , Cj) AsourceC ontext(lr, Ck) A 
propFilter(lr, p) A sourceFilterilr , sc) A targetFilter(lr , tc) A 
preferred(lr, Ck) A type(lr, PreferenceLiftingRule))A 
ist(cj,p(x, y ) A type(x, sc) A type(y, tc )) A 
-i ist(ck, (3 (z)p(x, z) A type{x, sc) A type(z , tc) A (z ^ a;)) — > 
ist{ci , p(x, y)) 

(4) 

— Mapping Constants: One of the most common transformations required 
is to distinguish between slightly different uses of the same term or to 
normalize the use of different terms for the same concept. These lifting rules 
specify the source term and the target term. As with selective importing 
lifting rules, we can constrain the application of these mappings to specific 
categories of triples. Example: Many different sites use the term price , some 
referring to price with tax, some to price with tax and shipping, etc. This 
class of lifting rules can be used to distinguish between them in the aggregate 
context. The earlier example of nuclear power plant capacity also falls into 
this category. The defining axiom of Term Mapping rules is as follows: 

ist(ci , targetContext(lr,Ci) A soiirceContext(lr,Cj)A 

propMapTo{lr,p 2 ) A sourceFilterilr , sc) A targetFilter(lr , tc) A 
propMapFrom(lr,pi) A type(lr , TermM apping Li f ting Rule)) A 
ist(cj, pi(x, y) A type(x, sc) A type{y, tc)) -A ist(c il p 2 {x 1 y)) 



( 5 ) 
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— Mapping more complex graph patterns: All the above lifting rules 
deal with cases where the target graph is isomorphic to (portions of) the 
source graph. Sometimes, this constraint cannot be satisfied. For example, 
if the source leaves time implicit and the target has an explicit model of 
time, the source and target graphs are not likely to be isomorphic. 
Assuming we don’t use an explicit ist in the base language, we can 
introduce a special construct for each phenomenon, such as making implicit 
time explicit. With this approach, we would introduce a property type 
(contextTemporal Model) to specify the model of time used by a context 
(implicit, Situation Calculus, Histories, etc.). In the case where the context 
used implicit time, we use another property type ( contextlmplicitTime ) to 
specify the implicit time. Using a reified version of the Situation Calculus 
to represent time, the following axiom defines these property types. 

ist(ci, contextTemporalModel(ci,ImplicitStaticTime)A 
context I mplicitT ime{ci , U) ) A 

ist{cj , targetFilter(lr , tc) A contextTemporalM odel(cj , SitCatM odel ) A 

type(lr , SitCalLif ting Rule) ApropFilterQr, p) AsourceFilter(lr, sc)) A 
targetContext(lr, ci) A sourceC ontextllr, cj) A 
ist(cj, type(x, sc) A typefy , tc) A p(x, y)) — > 

ist(ci, (3 z)type(z, SitProp) A time(z, tf) A 
sitProp(z,p) A sitSource(z,x) A sitTarget(z, y)) 

( 6 ) 

A more general solution is to identify common transformation patterns (as 
opposed to particular phenomenon) and introduce vocabulary to handle 
these. For example, a common pattern involves reifying a particular triple 
to make something explicit. Examples of this include making time and 
propositional attitudes. Another common pattern involves transforming a 
literal into a resource. A common example of this is to make the unit of 
measure or language explicit. 

Default Lifting: The constructs described above are monotonic. In practice, 
it is often convenient to be able to say that all of the contents of one context 
should be included in the aggregate context without any modification unless 
one of the other lifting rules applies. To do this, we introduce a property type 
analogous to importsFrom, called default Imports From that specifies this. 

While not exhaustive, we believe this vocabulary and associated set of lifting 
rules are sufficient to solve many issues that arise in data aggregation on the 
Semantic Web. More importantly, this functionality can be incorporated into the 
Semantic Web with fairly small and simple additions to the existing standards. 

6 Model Theory 

In the last section we discussed some potential alternatives for introducing con- 
texts into the Semantic Web. In this section, we discuss issues related to contexts 




